#Install usual data science libraries
!pip install numpy scipy matplotlib scikit-learn pandas rdkit seaborn plotly
#Install tmap
!pip install tmap-viz
#Download ESOL dataset
!mkdir data/
!wget https://raw.githubusercontent.com/schwallergroup/ai4chem_course/main/notebooks/04%20-%20Unsupervised%20Learning/data/esol.csv -O data/esol.csv
10 Week 4 tutorial 1 - AI 4 Chemistry
Table of content
- Unsupervised learning: dimensionality reduction
- PCA
- t-SNE
- TMAP
0. Relevant packages
Scikit-learn
We will use again the scikit-learn
package, which contains the PCA
and TSNE
methods that we will implement.
TMAP
TMAP
is a powerful visualization method capable of representing high-dimensional datasets as a 2D-tree. It can be applied in different domains apart from Chemistry. If you want to know more, you can check the original paper.
We first install the necessary libraries and get the corresponding dataset
1. - Dimensionality reduction
Dimensionality reduction
is a fundamental concept in unsupervised learning that aims to reduce the number of features or variables in high-dimensional datasets while preserving the most relevant information. This technique is particularly relevant when dealing with large and complex datasets with a high number of features in common. Besides, it can help scientists to better understand the underlying structure and relationships in their data. Here are some of the most common methods in dimensionality reduction:
- PCA (Principal Component Analysis)
- t-SNE (t-distributed Stochastic Neighbor Embedding)
- NMF (Non-Negative Matrix Factorization)
- UMAP (Uniform Manifold Approximation and Projection)
By reducing the dimensionality of the data, it is also possible to visualize and interpret the data more easily, and to develop more efficient and accurate predictive models. In this notebook we will explore some dimensionality reduction methods.
2. PCA
PCA (Principal Component Analysis
) is a popular unsupervised learning technique used for dimensionality reduction. It aims to transform high-dimensional data into a lower-dimensional space while preserving the most important information by identifying the principal components of the data. PCA is widely used in data analysis, visualization, and feature extraction.
Exercise 1: ESOL dataset dimensionality reduction with PCA
In this exercise, we will apply PCA to the 2048-dimensional fingerprints representing the molecules in the ESOL dataset. We will try to reduce this space to 2 dimensions and plot the resulting space. Normally, before applying PCA you have to standardize your data, but in this case it is not necessary as we use binary features.
import pandas as pd
from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.Chem import PandasTools
import numpy as np
#Load ESOL
= pd.read_csv('data/esol.csv')
esol
### YOUR CODE #####
#Create a 'Molecule' column containing rdkit.Mol from each molecule
#Create Morgan fingerprints (r=2, nBits=2048) from Molecule column using apply()
####
esol.head()
Now, we apply the PCA
decomposition. You can check the documentation of the method here.
from sklearn.decomposition import PCA
#### YOUR CODE ####
#create PCA object with n=2
#Create a numpy array containing the fingerprints
#Apply the fit_transform method to the previous array and store it in coordinates
#Add PC1 and PC2 values to each row
####
Let’s plot the data using PC1 and PC2
import seaborn as sns
import matplotlib.pyplot as plt
=esol, x='PC1', y='PC2')
sns.scatterplot(data
'ESOL PCA plot'); plt.title(
Finally, we will create a special category of labels to add to this plot. The labels will represent solubility categories. For the sake of simplicity, we will create 3 categories:
- Low: log solubility lower than -5
- Medium: log solubility between -5 and -1
- High: log solubility higher than -1
The only purpose of this classification is adding more information to the plot, so you can explore different interpretations of the reduced space you are representing.
#Create function to add labels
def solubility_class(log_sol):
'''Return the corresponding label according to solubility value
'''
if log_sol < -5:
return 'Low'
elif log_sol > -1:
return 'High'
else:
return 'Medium'
### YOUR CODE ####
#Add labels to the ESOL dataset by applying the previous function
#Create the PCA plot again including the new labels
#####
'ESOL PCA plot with solubility label'); plt.title(
As you can see, the label categories are quite mixed, and the plot do not clearly show a trend in our data. However, you may keep trying different visualizations (for example, you could add another dimension to the plot by including the PC3 and try to see if a 3D representation gives more information).
3. t-SNE
t-distributed Stochastic Neighbor Embedding
(t-SNE). In contrast to PCA, t-SNE is able to separate nonlinear data, and it can be therefore more powerful to capture local structures and identify clusters of data points with similar features. However, it is computationally more expensive than PCA and it may not be suitable for very large datasets.
Exercise 2: ESOL dataset dimensionality reduction with t-SNE
In the following example, we will apply t-SNE to the previous dataset and compare the result to PCA decomposition. You can check the documentation here
from sklearn.manifold import TSNE
### YOUR CODE ####
#Create a tsne object with n_components=2 and random_state=42. The latter parameter is used to ensure
#reproducibility (this is a non-deterministic algorithm)
#get the fp array
#apply fit_transform() to the data
#create columns with the tSNE coordinates in the original df
#####
Now, plot the results including the solubility label and compare them to the PCA plot. Do you observe differences between the algorithms?
### YOUR CODE ####
####
'ESOL t-SNE plot'); plt.title(
4. TMAP
Finally, we will use TMAP for visualizing our solubility dataset. A TMAP plot is constructed in 4 phases:
1. LSH forest indexing: data are indexed in an LSH forest data structure
2. kNN Graph Generation: data are clustered using a c-approximate kNN graph
3. MST Computation: a Minimum Spanning Tree is calculated
4. Layout generation of the MST
The corresponding representation displays the data as a tree in 2D, showing the relationships between the different points not only through the presence of clusters but also through the branches of the tree.
Below we show how to create a simple visualization of our data via TMAP.
import tmap as tm
#get fingerprints of molecules
= esol['fp'].values
fps
#Transform fingerprints into tmap vectors
= [tm.VectorUchar(fp) for fp in fps]
vec_fp
#Create MinHash encoder and LSH Forest
= tm.Minhash(512)
enc = tm.LSHForest(512, 128)
lf
#add vec_fp to minhash encoder and then pass it to the LSH forest
lf.batch_add(enc.batch_from_binary_array(vec_fp))
lf.index()
# Configuration for the tmap layout
= tm.LayoutConfiguration()
CFG = 1 / 50
CFG.node_size
#Compute graph
= tm.layout_from_lsh_forest(lf, CFG) x, y, s, t, _
Once you have computed the graph, you can plot it. In our case, we use matplotlib.
#Create figure
= plt.subplots(figsize=(12,12))
fig, ax
#Create a class to convert solubility class to integers
def solubility_class_to_int(log_sol):
'''Return the corresponding label according to solubility value
'''
if log_sol < -5:
return 0
elif log_sol > -1:
return 1
else:
return 2
#apply previous function (we create an array that will be used in the plotting function)
'int_class'] = esol['log solubility (mol/L)'].apply(solubility_class_to_int)
esol[
#Plot edges
for i in range(len(s)):
plt.plot(
[x[s[i]], x[t[i]]],
[y[s[i]], y[t[i]]],"k-",
=0.5,
linewidth=0.5,
alpha=1,
zorder
)
#Plot the vertices
= ax.scatter(x, y, c=esol['int_class'].values, cmap='Set1', s=2, zorder=2)
scatter
plt.tight_layout()= ['High', 'Medium', 'Low']
classes =scatter.legend_elements()[0], labels=classes)
plt.legend(handles'ESOL TMAP visualization')
plt.title( plt.plot()
Alternatively, you can use other libraries like plotly or Faerun to plot the data in a more interactive mode.
#example using plotly
import plotly.express as px
= px.scatter(x=x, y=y, color=esol['sol_class'].values,
fig =esol['smiles'].values, color_continuous_scale=['#FF0000','#4169E1','#2E8B57'], title='ESOL TMAP')
hover_name
fig.show()
We recommend you to check the original repo to observe the different possibilities of applying TMAP. Cheers to Daniel Probst for creating this great tool!